Goto

Collaborating Authors

 cnn accelerator


Implémentation Efficiente de Fonctions de Convolution sur FPGA à l'Aide de Blocs Paramétrables et d'Approximations Polynomiales

Magalhães, Philippe, Fresse, Virginie, Suffran, Benoît, Alata, Olivier

arXiv.org Artificial Intelligence

Implementing convolutional neural networks (CNNs) on field-programmable gate arrays (FPGAs) has emerged as a promising alternative to GPUs, offering lower latency, greater power efficiency and greater flexibility. However, this development remains complex due to the hardware knowledge required and the long synthesis, placement and routing stages, which slow down design cycles and prevent rapid exploration of network configurations, making resource optimisation under severe constraints particularly challenging. This paper proposes a library of configurable convolution Blocks designed to optimize FPGA implementation and adapt to available resources. It also presents a methodological framework for developing mathematical models that predict FPGA resources utilization. The approach is validated by analyzing the correlation between the parameters, followed by error metrics. The results show that the designed blocks enable adaptation of convolution layers to hardware constraints, and that the models accurately predict resource consumption, providing a useful tool for FPGA selection and optimized CNN deployment.


FPGA-Based Neural Network Accelerators for Space Applications: A Survey

Antunes, Pedro, Podobas, Artur

arXiv.org Artificial Intelligence

Space missions are becoming increasingly ambitious, necessitating high-performance onboard spacecraft computing systems. In response, field-programmable gate arrays (FPGAs) have garnered significant interest due to their flexibility, cost-effectiveness, and radiation tolerance potential. Concurrently, neural networks (NNs) are being recognized for their capability to execute space mission tasks such as autonomous operations, sensor data analysis, and data compression. This survey serves as a valuable resource for researchers aiming to implement FPGA-based NN accelerators in space applications. By analyzing existing literature, identifying trends and gaps, and proposing future research directions, this work highlights the potential of these accelerators to enhance onboard computing systems.


A 10.60 $\mu$W 150 GOPS Mixed-Bit-Width Sparse CNN Accelerator for Life-Threatening Ventricular Arrhythmia Detection

Qin, Yifan, Jia, Zhenge, Yan, Zheyu, Mok, Jay, Yung, Manto, Liu, Yu, Liu, Xuejiao, Wen, Wujie, Liang, Luhong, Cheng, Kwang-Ting Tim, Hu, X. Sharon, Shi, Yiyu

arXiv.org Artificial Intelligence

This paper proposes an ultra-low power, mixed-bit-width sparse convolutional neural network (CNN) accelerator to accelerate ventricular arrhythmia (VA) detection. The chip achieves 50% sparsity in a quantized 1D CNN using a sparse processing element (SPE) architecture. Measurement on the prototype chip TSMC 40nm CMOS low-power (LP) process for the VA classification task demonstrates that it consumes 10.60 $\mu$W of power while achieving a performance of 150 GOPS and a diagnostic accuracy of 99.95%. The computation power density is only 0.57 $\mu$W/mm$^2$, which is 14.23X smaller than state-of-the-art works, making it highly suitable for implantable and wearable medical devices.


PENDRAM: Enabling High-Performance and Energy-Efficient Processing of Deep Neural Networks through a Generalized DRAM Data Mapping Policy

Putra, Rachmad Vidya Wicaksana, Hanif, Muhammad Abdullah, Shafique, Muhammad

arXiv.org Artificial Intelligence

Convolutional Neural Networks (CNNs), a prominent type of Deep Neural Networks (DNNs), have emerged as a state-of-the-art solution for solving machine learning tasks. To improve the performance and energy efficiency of CNN inference, the employment of specialized hardware accelerators is prevalent. However, CNN accelerators still face performance- and energy-efficiency challenges due to high off-chip memory (DRAM) access latency and energy, which are especially crucial for latency- and energy-constrained embedded applications. Moreover, different DRAM architectures have different profiles of access latency and energy, thus making it challenging to optimize them for high performance and energy-efficient CNN accelerators. To address this, we present PENDRAM, a novel design space exploration methodology that enables high-performance and energy-efficient CNN acceleration through a generalized DRAM data mapping policy. Specifically, it explores the impact of different DRAM data mapping policies and DRAM architectures across different CNN partitioning and scheduling schemes on the DRAM access latency and energy, then identifies the pareto-optimal design choices. The experimental results show that our DRAM data mapping policy improves the energy-delay-product of DRAM accesses in the CNN accelerator over other mapping policies by up to 96%. In this manner, our PENDRAM methodology offers high-performance and energy-efficient CNN acceleration under any given DRAM architectures for diverse embedded AI applications.


SAfEPaTh: A System-Level Approach for Efficient Power and Thermal Estimation of Convolutional Neural Network Accelerator

Chen, Yukai, Yang, Simei, Bhattacharjee, Debjyoti, Catthoor, Francky, Mallik, Arindam

arXiv.org Artificial Intelligence

--The design of energy-efficient, high-performance, and reliable Convolutional Neural Network (CNN) accelerators involves significant challenges due to complex power and thermal management issues. This paper introduces SAfEPaTh, a novel system-level approach for accurately estimating power and temperature in tile-based CNN accelerators. Unlike traditional methods, it eliminates the need for circuit-level simulations or on-chip measurements. Through rigorous simulation results using the ResNet18 model, we demonstrate SAfEPaTh's capability to accurately estimate power and temperature within 500 seconds, encompassing CNN model accelerator mapping exploration and detailed power and thermal estimations. Furthermore, SAfEPaTh's adaptability extends its utility across various CNN models and accelerator architectures, underscoring its broad applicability in the field. This study contributes significantly to the advancement of energy-efficient and reliable CNN accelerator designs, addressing critical challenges in dynamic power and thermal management. I NTRODUCTION Convolutional Neural Network (CNN) accelerators have become essential for many modern applications, such as autonomous driving, image and speech recognition, and natural language processing, all of which demand low power consumption, high performance, and high reliability [1]. However, as semiconductor scaling approaches its physical limits, managing dynamic power and thermal issues in accelerators executing CNN models poses increasing challenges.

  Country:
  Genre: Research Report (0.50)
  Industry: Information Technology (0.66)

SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs

Vatsavai, Sairam Sri, Karempudi, Venkata Sai Praneeth, Thakkar, Ishan, Salehi, Ahmad, Hastings, Todd

arXiv.org Artificial Intelligence

The acceleration of a CNN inference task uses convolution operations that are typically transformed into vector-dot-product (VDP) operations. Several photonic microring resonators (MRRs) based hardware architectures have been proposed to accelerate integer-quantized CNNs with remarkably higher throughput and energy efficiency compared to their electronic counterparts. However, the existing photonic MRR-based analog accelerators exhibit a very strong trade-off between the achievable input/weight precision and VDP operation size, which severely restricts their achievable VDP operation size for the quantized input/weight precision of 4 bits and higher. The restricted VDP operation size ultimately suppresses computing throughput to severely diminish the achievable performance benefits. To address this shortcoming, we for the first time present a merger of stochastic computing and MRR-based CNN accelerators. To leverage the innate precision flexibility of stochastic computing, we invent an MRR-based optical stochastic multiplier (OSM). We employ multiple OSMs in a cascaded manner using dense wavelength division multiplexing, to forge a novel Stochastic Computing based Optical Neural Network Accelerator (SCONNA). SCONNA achieves significantly high throughput and energy efficiency for accelerating inferences of high-precision quantized CNNs. Our evaluation for the inference of four modern CNNs at 8-bit input/weight precision indicates that SCONNA provides improvements of up to 66.5x, 90x, and 91x in frames-per-second (FPS), FPS/W and FPS/W/mm2, respectively, on average over two photonic MRR-based analog CNN accelerators from prior work, with Top-1 accuracy drop of only up to 0.4% for large CNNs and up to 1.5% for small CNNs. We developed a transaction-level, event-driven python-based simulator for the evaluation of SCONNA and other accelerators (https://github.com/uky-UCAT/SC_ONN_SIM.git).


Photonic Reconfigurable Accelerators for Efficient Inference of CNNs with Mixed-Sized Tensors

Vatsavai, Sairam Sri, Thakkar, Ishan G

arXiv.org Artificial Intelligence

Photonic Microring Resonator (MRR) based hardware accelerators have been shown to provide disruptive speedup and energy-efficiency improvements for processing deep Convolutional Neural Networks (CNNs). However, previous MRR-based CNN accelerators fail to provide efficient adaptability for CNNs with mixed-sized tensors. One example of such CNNs is depthwise separable CNNs. Performing inferences of CNNs with mixed-sized tensors on such inflexible accelerators often leads to low hardware utilization, which diminishes the achievable performance and energy efficiency from the accelerators. In this paper, we present a novel way of introducing reconfigurability in the MRR-based CNN accelerators, to enable dynamic maximization of the size compatibility between the accelerator hardware components and the CNN tensors that are processed using the hardware components. We classify the state-of-the-art MRR-based CNN accelerators from prior works into two categories, based on the layout and relative placements of the utilized hardware components in the accelerators. We then use our method to introduce reconfigurability in accelerators from these two classes, to consequently improve their parallelism, the flexibility of efficiently mapping tensors of different sizes, speed, and overall energy efficiency. We evaluate our reconfigurable accelerators against three prior works for the area proportionate outlook (equal hardware area for all accelerators). Our evaluation for the inference of four modern CNNs indicates that our designed reconfigurable CNN accelerators provide improvements of up to 1.8x in Frames-Per-Second (FPS) and up to 1.5x in FPS/W, compared to an MRR-based accelerator from prior work.


More Efficient AI Training With A Systolic Neural CPU

#artificialintelligence

Since modern machine learning came onto the scene, the push has been on to make workloads leveraging the technologies as efficient as possible. Given what these applications are being asked to do – from being a cornerstone of the oncoming autonomous vehicle era to real-time analytics to helping enterprises more user-friendly, more profitable and more able to thwart cyberattacks – the speed and efficiency of such techniques as deep neural networks (DNNs) has been at the forefront. That has included leveraging the capabilities of accelerators like GPUs and FPGAs and introducing new devices, such as Amazon Web Services' Trainium devices for AI and machine learning training tasks in the cloud. However, such efforts have only addressed part of the issue, according to Yuhao Ju, a Ph.D candidate at Northwestern University. What so many of the efforts to make DNN accelerators more efficient have missed is the need to improve the end-to-end performance of the deep learning tasks themselves, where the movement of data is still a significant factor in the overall execution time of the job. Speaking at the recent virtual ISSCC event, Ju said that for high-performing machine learning jobs, a CPU also is needed for pre- and post-processing and data preparation for the accelerator.


CoDR: Computation and Data Reuse Aware CNN Accelerator

Khadem, Alireza, Ye, Haojie, Mudge, Trevor

arXiv.org Artificial Intelligence

Computation and Data Reuse is critical for the resource-limited Convolutional Neural Network (CNN) accelerators. This paper presents Universal Computation Reuse to exploit weight sparsity, repetition, and similarity simultaneously in a convolutional layer. Moreover, CoDR decreases the cost of weight memory access by proposing a customized Run-Length Encoding scheme and the number of memory accesses to the intermediate results by introducing an input and output stationary dataflow. Compared to two recent compressed CNN accelerators with the same area of 2.85 mm^2, CoDR decreases SRAM access by 5.08x and 7.99x, and consumes 3.76x and 6.84x less energy.


ROMANet: Fine-Grained Reuse-Driven Off-Chip Memory Access Management and Data Organization for Deep Neural Network Accelerators

Putra, Rachmad Vidya Wicaksana, Hanif, Muhammad Abdullah, Shafique, Muhammad

arXiv.org Artificial Intelligence

Enabling high energy efficiency is crucial for embedded implementations of deep learning. Several studies have shown that the DRAM-based off-chip memory accesses are one of the most energy-consuming operations in deep neural network (DNN) accelerators, and thereby limit the designs from achieving efficiency gains at the full potential. DRAM access energy varies depending upon the number of accesses required as well as the energy consumed per-access. Therefore, searching for a solution towards the minimum DRAM access energy is an important optimization problem. Towards this, we propose the ROMANet methodology that aims at reducing the number of memory accesses, by searching for the appropriate data partitioning and scheduling for each layer of a network using a design space exploration, based on the knowledge of the available on-chip memory and the data reuse factors. Moreover, ROMANet also targets decreasing the number of DRAM row buffer conflicts and misses, by exploiting the DRAM multi-bank burst feature to improve the energy-per-access. Besides providing the energy benefits, our proposed DRAM data mapping also results in an increased effective DRAM throughput, which is useful for latency-constraint scenarios. Our experimental results show that the ROMANet saves DRAM access energy by 12% for the AlexNet, by 36% for the VGG-16, and by 46% for the MobileNet, while also improving the DRAM throughput by 10%, as compared to the state-of-the-art.